As we learned in class, LD is the
non-random assortment of alleles at different loci across the genome.
Linked alleles will frequently travel with each other during crossover
events during meiosis, and if we know LD is high between two alleles is
present in a population we can even use the frequency of one allele to
predict the frequency of the other allele. Today, we will look at LD
within UCP1 in our own populations to get both a sense of the
amount of linkage present within the gene, and what that linkage can
tell us about our assigned population’s history.
In the Ramos
et al. (2012) paper that you read for this module, they
looked at 3 SNPs within UCP1 that were thought to be associated
with obesity: rs6536991 (4:140560427), rs2270565
(4:140562317) and rs12502572 (4:140563980). The results of
their analyses suggested that only rs6536991 was directly
associated with obesity in the Brazilian population under investigation.
The other two alleles, however - rs2270565 and
rs12502572 - were found to be in various levels of linkage with
rs6536991. We will take a closer look at these three SNPs in
our own populations, as well as two other SNPs in UCP1 called
rs3811787 (4:140569265) and the famous rs1800592
(4:140572807), which Cha
et al. (2008) found to be associated with abdominal fat
stores in a population of Korean women.
We will be using two statistics to gauge LD in our populations:
Lewontin’s D’ (pronounced ‘D prime’) and
r2 (pronounced ‘r-squared’). Both
of these are useful for determining the amount of linkage between two
SNPs, but each statistic tells us something slightly different.
D’ is slightly easier to understand, as it simply is a
measure of the predictability of one SNP’s genotype based on the other.
When D’ = 1, the two SNPs are in perfect LD (meaning that they
will always co-segregate, or be inherited together as a unit),
and when D’ = 0 the two SNPs are in Linkage Equilibrium
(meaning that co-segregation is random, or around 50%).
r2 is a little bit different. This statistic will also take into account the frequency of the allele in question. If one SNP genotype is linked to another, but the linked genotype of one SNP is the minor allele (less common than the other genotype), the r2 value will be lower than the D’ value. This does not make the SNPs any less linked, but is rather taking into account the (lower) allele frequency.
Learn about the SNPs rs6536991, rs2270565, rs12502572, rs3811787, and rs1800592 and their roles in human obesity.
Learn how to use various R packages to perform LD analysis in a population, including constructing LD matrices and LD heatmaps.
Learn about the two statistics D’ and r2, which are the most commonly used statistics to evaluate LD between SNPs. Learn what each can tell us about a population, and apply the two statistics to our own populations.
Learn how to use EnsEMBL to look at long-distance LD
between SNPs in UCP1, as well as for SNPs in other genes.
Log in to the SCC On Demand and bring up the R Studio window like we did in the last module:
Now that we’re in R, we can prepare our data for analysis.
The functions we will use today require our VCF data to be in a special
format called a SNPMatrix, so we’ll convert our data in to a
SNPMatrix now. As usual, I will use the population YRI again as
an example; change all YRI code to your population code!
First, we’ll need to install a few packages (remember to always say ‘yes’ to downloading dependencies!):
install.packages("devtools")
library(devtools)
install.packages("BiocManager")
library(BiocManager)
BiocManager::install("snpStats")
BiocManager::install("VariantAnnotation")
devtools::install_github("SFUStatgen/LDheatmap")
devtools::install_github("felix-clark/gpart")
install.packages("geneHapR")
And then we’ll load the packages we will be using:
library(tidyverse)
library(snpStats)
library(vcfR)
library(LDheatmap)
Next, we will load in our data, in this case the VCF file of the
extended area of SNPs in UCP1 (for me called
UCP1_YRI_REGION.vcf) in to our R
space first, and then use another function to transform them into
SNPMatrices.
We’ll start with the full UCP1 file that you made for your
Homework for Module 1 (which should include 5,000 or 10,000 bp up- and
down-stream of the gene region itself. Remember to replace
YRI with the acronym for your population, and
caschmit with your own named directory:
UCP1vcf <- read.vcfR("/projectnb/anth333/caschmit/UCP1_YRI_REGION.vcf")
## Scanning file to determine attributes.
## File attributes:
## meta lines: 93
## header_line: 94
## variant count: 150
## column count: 117
## Meta line 93 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 150
## Character matrix gt cols: 117
## skip: 0
## nrows: 150
## row_num: 0
## Processed variant: 150
## All variants processed
Next, we’ll convert it to a SNPMatrix. Let’s take a look:
UCP1matrix <- vcfR2SnpMatrix(UCP1vcf)
UCP1matrix #repeating the file name will show you what it looks like
## $genetic.distances
## [1] 140554647 140554683 140554886 140555116 140555173 140555183 140555439
## [8] 140555474 140555490 140555558 140555670 140555762 140555842 140556119
## [15] 140556351 140556812 140556886 140556943 140557061 140557239 140557309
## [22] 140557353 140557372 140557496 140557525 140557682 140557788 140557941
## [29] 140558329 140558375 140558513 140558578 140558769 140558845 140559713
## [36] 140559713 140559846 140559997 140560049 140560049 140560366 140560427
## [43] 140560607 140560687 140560748 140560748 140560748 140560748 140561152
## [50] 140561330 140561357 140561527 140561631 140561674 140562317 140562581
## [57] 140562627 140562887 140563217 140563611 140563611 140563980 140564169
## [64] 140564332 140564594 140564853 140564891 140564982 140565006 140565178
## [71] 140565377 140565390 140565830 140565846 140566542 140566594 140566801
## [78] 140566828 140567578 140567641 140567738 140567914 140567935 140568016
## [85] 140568045 140568074 140568117 140568333 140568763 140568792 140568842
## [92] 140568861 140568909 140568998 140569040 140569219 140569265 140569304
## [99] 140569481 140569496 140569510 140569826 140569931 140569940 140569941
## [106] 140570045 140570066 140570422 140570429 140570542 140570604 140570619
## [113] 140570765 140570776 140570850 140570901 140570919 140570972 140571067
## [120] 140571085 140571190 140571295 140571482 140571531 140571583 140571632
## [127] 140571789 140571800 140571850 140571852 140572036 140572112 140572685
## [134] 140572723 140572807 140572930 140572950 140573074 140573151 140573322
## [141] 140573322 140573322 140573322 140573346 140573371 140573392 140573402
## [148] 140573493 140573712 140573773
##
## $subjectID
## [1] "NA18486" "NA18488" "NA18489" "NA18498" "NA18499" "NA18501" "NA18502"
## [8] "NA18504" "NA18505" "NA18507" "NA18508" "NA18510" "NA18511" "NA18516"
## [15] "NA18517" "NA18519" "NA18520" "NA18522" "NA18523" "NA18853" "NA18856"
## [22] "NA18858" "NA18861" "NA18864" "NA18865" "NA18867" "NA18868" "NA18870"
## [29] "NA18871" "NA18873" "NA18874" "NA18876" "NA18877" "NA18878" "NA18879"
## [36] "NA18881" "NA18907" "NA18908" "NA18909" "NA18910" "NA18912" "NA18915"
## [43] "NA18916" "NA18917" "NA18923" "NA18924" "NA18933" "NA18934" "NA19092"
## [50] "NA19093" "NA19095" "NA19096" "NA19098" "NA19099" "NA19102" "NA19107"
## [57] "NA19108" "NA19113" "NA19114" "NA19116" "NA19117" "NA19118" "NA19119"
## [64] "NA19121" "NA19129" "NA19130" "NA19131" "NA19137" "NA19138" "NA19141"
## [71] "NA19143" "NA19144" "NA19146" "NA19147" "NA19149" "NA19152" "NA19153"
## [78] "NA19159" "NA19160" "NA19171" "NA19172" "NA19175" "NA19184" "NA19185"
## [85] "NA19189" "NA19190" "NA19197" "NA19198" "NA19200" "NA19201" "NA19204"
## [92] "NA19206" "NA19207" "NA19209" "NA19210" "NA19213" "NA19214" "NA19222"
## [99] "NA19223" "NA19225" "NA19235" "NA19236" "NA19238" "NA19239" "NA19247"
## [106] "NA19248" "NA19256" "NA19257"
##
## $data
## An XSnpMatrix with 216 rows (216 haploid and 0 diploid), and 150 columns
## Row names: NA18486.1 ... NA19257.2
## Col names: 4:140554647:CTG:C ... 4:140573773:A:G
You may notice that the output of the UCP1matrix gives us
some information about our data. Under $genetic.distances
you’ll see that our SNPMatrix includes physical positions on
the chromosome, and that our $subjectID is comprised of our
sample names, while our genotype data are contained in the
$data slot. You can use this information to count both how
many SNPs and how many individuals are in your database. In my case,
I’ve got 108 individuals in the YRI population, typed at 150 variants.
Your population and variant count might differ, as when we created out
VCF files we restricted to one sample population, and also cut out
variant sites that were homogenous (i.e., had no variation across our
individual populations).
ld FunctionNext, we will be using a function from the {snpStats} package to look
at LD across the UCP1 gene. To do this, we will use the
ld function to create an LD matrix, which we will then
graph to comprehend better. First, we’ll use the ld
function, like so (you should set the depth number to one
value less than the number of variants in your VCF file):
#For "Stats," use "R.squared" for now. We will explore D' later.
LD <- snpStats::ld(UCP1matrix$data, depth = 149, stats = "R.squared")
If you type LD into your console, you’ll see the
large matrix of r2 LD values you’ve just created, which is
probably fairly confusing on first look. An LD analysis output like this
can be hard to interpret without a visual aid, so let’s build one! The
output of this function (our LD object) is what we’ll use
as an input to visualize these LD values.
#sets color scale for the graph
cols = colorRampPalette(c("yellow", "red"))(10)
#building the image
image(LD, lwd = 0, cuts= 9, col.regions=cols, colorkey=TRUE)
You can see from the figure here in the module that much of this gene in the YRI population shows very low LD (consulting the scale on the right, that would be the yellow regions), meaning that these regions are segregating randomly. The regions in red are those regions that have very high LD, suggesting non-random segregation. Large, contiguous regions of red are what we might call linkage blocks, or larger contiguous areas along the genome with consistently high linkage. What could this mean about these regions of the gene?
Now, we will look at our smaller dataset so we can focus on five SNPs
in particular: rs6536991 (4:140560427), rs2270565
(4:140562317), rs12502572 (4:140563980), rs3811787
(4:140569265), and rs1800592 (4:140572807). Now, three of these
you read about in the homework as being either SNPs related to obesity,
or in moderate linkage with obesity-related SNPs, in a Brazilian
population; while the final two were associated with abdominal obesity
in a Korean population of women. Let’s see what they look like in
our populations.
What we will be doing now is
creating two LD heatmaps, one that displays the
r2 LD statistics and one that displays the
D’ LD statistics. As we discussed in the introduction, these
two statistics don’t tell us quite the same things about our data, so
comparing the two statistics can be useful in making conclusions about
LD in your populations.
Now, here is the code to create the LD heatmap showing the
r2 statistic (notice where I’ve entered the names of
our SNPs of interest, to note their locations on the heatmap):
R2heatmapUCP1 <- LDheatmap(UCP1matrix$data,
genetic.distances=UCP1matrix$genetic.distances,
distances="physical",
LDmeasure="r",
title="Pairwise LD in UCP1 for YRI using R^2",
add.map=TRUE, add.key=TRUE,
geneMapLocation=0.15,
SNP.name=c("4:140560427:T:C", "4:140562317:T:A",
"4:140563980:G:A", "4:140569265:T:G",
"4:140572807:T:C"),
color=NULL, newpage=TRUE,
name="UCP1 LD Heatmap (R^2)")
This looks nice, but it’s still pretty difficult to see LD blocks… if you want to prettify your LD heatmap, you can check out this resource here, which is where I got the code for the color modifications, below (which I find easier to read than grayscale or yellow/red). You can see the codes and how they modify the heatmap as I move along, below:
#This code establishes the color palette for the heatmap
rgb.palette <- colorRampPalette(rev(c("grey85", "green", "darkgreen")), space = "rgb")
#This code will save the subsequent plot
png(filename="R2heatmapUCP1_YRI.png",units="in",res=600,width=6,height=5)
#Here's the code for the heatmap itself
R2heatmapUCP1 <- LDheatmap(UCP1matrix$data,
genetic.distances=UCP1matrix$genetic.distances,
distances="physical",
LDmeasure="r",
title="Pairwise LD in UCP1 for YRI using R^2",
add.map=TRUE, add.key=TRUE,
geneMapLocation=0.15,
SNP.name=c("4:140560427:T:C", "4:140562317:T:A",
"4:140563980:G:A", "4:140569265:T:G",
"4:140572807:T:C"),
color=rgb.palette(18), newpage=TRUE,
name="R2heatmapUCP1")
#There is USUALLY a way to add the gene map to this... but UCSC Genome Browser recently changed they way their website works, so this is broken, for now.
#addGenes <- LDheatmap.addGenes(R2heatmapUCP1, chromosome="chr4",genome="hg38")
#addRecomb <- LDheatmap.addRecombRate(R2heatmapUCP1, chr="chr4", genome="hg38")
#Here we're editing the symbols
grid::grid.edit("symbols", pch = 16, gp = grid::gpar(cex = 0.5,col="darkgreen"))
#And here we're editing the SNP names
grid::grid.edit("SNPnames", gp = grid::gpar(cex=0.75,col="darkgreen"))
#and this closes the plot
dev.off()
We can also add in an outline of the gene region, if that helps you to imagine where exactly all this LD is occurring:
#This code establishes the color palette for the heatmap
rgb.palette <- colorRampPalette(rev(c("grey85", "green", "darkgreen")), space = "rgb")
#This code will save the subsequent plot
png(filename="R2heatmapUCP1_YRI_Gene.png",units="in",res=600,width=6,height=5)
#Here's the code for the heatmap itself
R2heatmapUCP1 <- LDheatmap(UCP1matrix$data,
genetic.distances=UCP1matrix$genetic.distances,
distances="physical",
LDmeasure="r",
title="Pairwise LD in UCP1 for YRI using R^2",
add.map=TRUE, add.key=TRUE,
geneMapLocation=0.15,
SNP.name=c("4:140560427:T:C", "4:140562317:T:A",
"4:140563980:G:A", "4:140569265:T:G",
"4:140572807:T:C"),
color=rgb.palette(18), newpage=TRUE,
name="R2heatmapUCP1")
#There is USUALLY a way to add the gene map to this... but UCSC Genome Browser recently changed they way their website works, so this is broken, for now.
#addGenes <- LDheatmap.addGenes(R2heatmapUCP1, chromosome="chr4",genome="hg38")
#addRecomb <- LDheatmap.addRecombRate(R2heatmapUCP1, chr="chr4", genome="hg38")
#Highlight UCP1 gene region... remember, it's 4:140559431-140568961. In my SNPMatrix, this translates to rows 35-93; it may translate to *different* rows in your own!:
LDheatmap.highlight(R2heatmapUCP1, i = 35, j = 93, fill = "NA", col = "black", lwd = 0.5,lty = 1,flipOutline=FALSE, crissCross = TRUE)
#Here we're editing the symbols
grid::grid.edit("symbols", pch = 16, gp = grid::gpar(cex = 0.5,col="darkgreen"))
#And here we're editing the SNP names
grid::grid.edit("SNPnames", gp = grid::gpar(cex=0.75,col="darkgreen"))
#and this closes the plot
dev.off()
As you can see on first glance, there is potentially one large linkage block in the UCP1 region of the YRI population, starting downstream (beyond the end) of the gene region. Remember, UCP1 is on the reverse strand, so the gene is read from higher positions to lower positions, meaning that positions lower than the lowest numner in the gene’s position range are are downstream of the gene). It also looks like there may be a few small regions of LD in the upstream region of the gene, near our regulatory SNPs from the Cha et al. (2008) paper… one immediately upstream of the gene (in what may be the promoter region), and one much further.
There are a few things we can do to figure out exactly where these significant regions of LD actually are beyond just eyeballing them…
Maybe the easiest is a tool in the {gpart} package called
BigLD, which will help us statistically identify the
positions of independent linkage blocks in our gene region:
library(gpart)
ucp1_res = BigLD(genofile = "/projectnb/anth333/caschmit/UCP1_YRI_REGION.vcf",LD="r2")
ucp1_res
## CLQ done!constructLDblock done!
## chr start.index end.index start.rsID end.rsID start.bp
## 1 chr4 2 91 4:140554683:A:G 4:140568842:T:G 140554683
## 2 chr4 98 102 4:140569304:G:A 4:140569826:C:T 140569304
## 3 chr4 107 108 4:140570066:G:T 4:140570422:C:A 140570066
## 4 chr4 124 150 4:140571531:G:C 4:140573773:A:G 140571531
## end.bp
## 1 140568842
## 2 140569826
## 3 140570422
## 4 140573773
As you can see, {gpart} has identified four significant
blocks of LD in the YRI population across our UCP1 gene region…
let’s take a look at where these regions are by mapping them onto our
LDheatmap object:
#This code establishes the color palette for the heatmap
rgb.palette <- colorRampPalette(rev(c("grey85", "green", "darkgreen")), space = "rgb")
#This code will save the subsequent plot (notice the name change!)
png(filename="R2heatmapUCP1_YRI_LDblocks.png",units="in",res=600,width=6,height=5)
#Here's the code for the heatmap itself (nothing's changed here):
R2heatmapUCP1 <- LDheatmap(UCP1matrix$data,
genetic.distances=UCP1matrix$genetic.distances,
distances="physical",
LDmeasure="r",
title="Pairwise LD in UCP1 for YRI using R^2",
add.map=TRUE, add.key=TRUE,
geneMapLocation=0.15,
SNP.name=c("4:140560427:T:C", "4:140562317:T:A",
"4:140563980:G:A", "4:140569265:T:G",
"4:140572807:T:C"),
color=rgb.palette(18), newpage=TRUE,
name="R2heatmapUCP1")
#There is USUALLY a way to add the gene map to this... but UCSC Genome Browser recently changed they way their website works, so this is broken, for now.
#addGenes <- LDheatmap.addGenes(R2heatmapUCP1, chromosome="chr4",genome="hg38")
#addRecomb <- LDheatmap.addRecombRate(R2heatmapUCP1, chr="chr4", genome="hg38")
#Highlight UCP1 gene region... remember, it's 4:140559431-140568961. In my SNPMatrix, this translates to rows 35-93; it may translate to *different* rows in your own!:
LDheatmap.highlight(R2heatmapUCP1, i = 35, j = 93, fill = "NA", col = "black", lwd = 0.5,lty = 1,flipOutline=FALSE, crissCross = TRUE)
#Here we're adding our linkage blocks using the start/end index numbers in the {gpart} output, and choosing a different color for each block:
#Block 1:
LDheatmap.highlight(R2heatmapUCP1, i = 2, j = 91, fill = "NA", col = "gold", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 2:
LDheatmap.highlight(R2heatmapUCP1, i = 98, j = 102, fill = "NA", col = "gold3", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 3:
LDheatmap.highlight(R2heatmapUCP1, i = 107, j = 108, fill = "NA", col = "gold4", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 4:
LDheatmap.highlight(R2heatmapUCP1, i = 124, j = 150, fill = "NA", col = "goldenrod", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Here we're editing the symbols
grid::grid.edit("symbols", pch = 16, gp = grid::gpar(cex = 0.5,col="darkgreen"))
#And here we're editing the SNP names
grid::grid.edit("SNPnames", gp = grid::gpar(cex=0.75,col="darkgreen"))
#and this closes the plot
dev.off()
## quartz_off_screen
## 2
The first, which runs from positions 140554683-140568842, appears to represent that first block we can visually identify in the downstream region of the heatmap; it contains our three SNPs we learned about from Ramos et al., and indeed also contains almost the entire gene region (remember, UCP1 runs from 140559431-140568961 on the reverse strand) and its downstream enhancer region (which runs from 140556002-140559199).
The second linkage block, which runs from 140569304-140569826, is immediately upstream of the coding region for UCP1, and may represent the promoter, which runs from 140568951-140569599. How do I know that’s where the promoter is? It’s on EnsEMBL! You can see the promoter, enhancers, and their associated positions in the Region in Detail view on the human UCP1 gene page. As you can see, the -412A>C variant site (rs3811787) from the Cha et al. (2008) paper is right near the promoter, which might explain how it affects UCP1 function: if one of the variants at that locus alters how the promoter initiates transcription of UCP1, it could change UCP1 expression.
Although the other two linkage blocks further upstream from the gene are not annotated on EnsEMBL, the final locus from the Cha et al. (2008) paper also gives us a clue regarding what these linkage blocks might represent: the larger of the two contains the famous regulatory locus Cha et al. identified as also having a large impact on abdominal obesity in Koren women, -3826A>G (rs1800592), suggesting that these may also be either full regulatory regions for UCP1 that have not yet been identified as such, or that there may be LD in these areas due to selection on the regulatory effects of variants in these regions on UCP1 function.
Next, we’ll calculate and visualize LD using the D’
statistic:
#This code establishes the color palette for the heatmap
rgb.palette <- colorRampPalette(rev(c("grey85", "pink", "red")), space = "rgb")
#This code will save the subsequent plot
png(filename="DheatmapUCP1_YRI.png",units="in",res=600,width=6,height=5)
#Here's the code for the heatmap itself
DheatmapUCP1 <- LDheatmap(UCP1matrix$data,
genetic.distances=UCP1matrix$genetic.distances,
distances="physical",
LDmeasure="D",
title="Pairwise LD in UCP1 for YRI using D'",
add.map=TRUE, add.key=TRUE,
geneMapLocation=0.15,
SNP.name=c("4:140560427:T:C", "4:140562317:T:A",
"4:140563980:G:A", "4:140569265:T:G",
"4:140572807:T:C"),
color=rgb.palette(18), newpage=TRUE,
name="UCP1 LD Heatmap (D')")
#There is USUALLY a way to add the gene map to this... but UCSC Genome Browser recently changed they way their website works, so this is broken, for now.
#addGenes <- LDheatmap.addGenes(R2heatmapUCP1, chromosome="chr4")
#addRecomb <- LDheatmap.addRecombRate(R2heatmapUCP1, chr="chr4", genome="hg38")
#Highlight UCP1 gene region... remember, it's 4:140559431-140568961. In my SNPMatrix, this translates to rows 35-93; it may translate to *different* rows in your own!:
LDheatmap.highlight(DheatmapUCP1, i = 35, j = 93, fill = "NA", col = "black", lwd = 0.5,lty = 1,flipOutline=FALSE, crissCross = TRUE)
#Here we're editing the symbols
grid.edit("symbols", pch = 16, gp = gpar(cex = 0.5,col="red"))
#And here we're editing the SNP names
grid.edit("SNPnames", gp = gpar(cex=0.75,col="red"))
#and this closes the plot
dev.off()
Notice that measuring linkage using D’ appears to be fundamentally different… it looks like there is high LD across the entire chosen gene region!
Let’s take a look using {gpart}:
ucp1_res1 = BigLD(genofile = "/projectnb/anth333/caschmit/UCP1_REGION_YRI.vcf", LD="Dprime")
ucp1_res1
## CLQ done!constructLDblock done!
## chr start.index end.index start.rsID end.rsID
## 1 chr4 2 78 4:140554683:A:G 4:140566828:C:G
## 2 chr4 79 80 4:140567578:A:C 4:140567641:G:A
## 3 chr4 97 98 4:140569265:T:G 4:140569304:G:A
## 4 chr4 100 102 4:140569496:C:T 4:140569826:C:T
## 5 chr4 107 108 4:140570066:G:T 4:140570422:C:A
## 6 chr4 113 139 4:140570765:G:A 4:140573151:A:G
## 7 chr4 140 140 4:140573322:C:CTTTTTTT 4:140573322:C:CTTTTTTT
## 8 chr4 145 150 4:140573371:G:A 4:140573773:A:G
## start.bp end.bp
## 1 140554683 140566828
## 2 140567578 140567641
## 3 140569265 140569304
## 4 140569496 140569826
## 5 140570066 140570422
## 6 140570765 140573151
## 7 140573322 140573322
## 8 140573371 140573773
Here, we can see that, using the D’ measure, the
BigLD algorithm has divided the UCP1 genomic
region for the YRI population into 8 distinct linkage blocks, giving us
both starting and ending positions for each. Using the row number (1-8)
to name them, which linkage block(s) contain our SNPs of interest? Are
any of our SNPs now in different linkage blocks than before?
Let’s take a look (for this plot, I’ll make the highlighted linkage blocks solid so that they can be seen better):
#This code establishes the color palette for the heatmap
rgb.palette <- colorRampPalette(rev(c("grey85", "pink", "red")), space = "rgb")
#This code will save the subsequent plot
png(filename="DheatmapUCP1_YRI_LDblocks.png",units="in",res=600,width=6,height=5)
#Here's the code for the heatmap itself
DheatmapUCP1 <- LDheatmap(UCP1matrix$data,
genetic.distances=UCP1matrix$genetic.distances,
distances="physical",
LDmeasure="D",
title="Pairwise LD in UCP1 for YRI using D'",
add.map=TRUE, add.key=TRUE,
geneMapLocation=0.15,
SNP.name=c("4:140560427:T:C", "4:140562317:T:A",
"4:140563980:G:A", "4:140569265:T:G",
"4:140572807:T:C"),
color=rgb.palette(18), newpage=TRUE,
name="UCP1 LD Heatmap (D')")
#There is USUALLY a way to add the gene map to this... but UCSC Genome Browser recently changed they way their website works, so this is broken, for now.
#addGenes <- LDheatmap.addGenes(R2heatmapUCP1, chromosome="chr4")
#addRecomb <- LDheatmap.addRecombRate(R2heatmapUCP1, chr="chr4", genome="hg38")
#Highlight UCP1 gene region... remember, it's 4:140559431-140568961. In my SNPMatrix, this translates to rows 35-93; it may translate to *different* rows in your own!:
LDheatmap.highlight(DheatmapUCP1, i = 35, j = 93, fill = "NA", col = "black", lwd = 0.5,lty = 1,flipOutline=FALSE, crissCross = TRUE)
#Here we're adding our linkage blocks using the start/end index numbers in the {gpart} output, and choosing a different color for each block:
#Block 1:
LDheatmap.highlight(R2heatmapUCP1, i = 2, j = 78, fill = "gold", col = "gold", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 2:
LDheatmap.highlight(R2heatmapUCP1, i = 79, j = 80, fill = "gold3", col = "gold3", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 3:
LDheatmap.highlight(R2heatmapUCP1, i = 97, j = 98, fill = "gold4", col = "gold4", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 4:
LDheatmap.highlight(R2heatmapUCP1, i = 100, j = 102, fill = "goldenrod", col = "goldenrod", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 5:
LDheatmap.highlight(R2heatmapUCP1, i = 107, j = 108, fill = "gold", col = "gold", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 6:
LDheatmap.highlight(R2heatmapUCP1, i = 113, j = 139, fill = "gold3", col = "gold3", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 7:
LDheatmap.highlight(R2heatmapUCP1, i = 140, j = 141, fill = "gold4", col = "gold4", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Block 8:
LDheatmap.highlight(R2heatmapUCP1, i = 145, j = 150, fill = "goldenrod", col = "goldenrod", lwd = 2,lty = 1,flipOutline=FALSE, crissCross = FALSE)
#Here we're editing the symbols
grid.edit("symbols", pch = 16, gp = gpar(cex = 0.5,col="red"))
#And here we're editing the SNP names
grid.edit("SNPnames", gp = gpar(cex=0.75,col="red"))
#and this closes the plot
dev.off()
Now you’ve got several graphs that you can use to investigate LD in your population’s gene region! You can see specifically whether or not the SNPs of interest (the labelled SNPs) are in LD by finding where on the graph their paths intersect, or by looking for their linkage block by position in the LD block analysis output.
Now, for the purpose of reporting your pairwise LD results, we can
look at something called an LD matrix, which is automatically
generated by the LDheatmap function. This is a matrix where
the row and column names are SNP ID numbers, and the cell at the
intersection of a row and a column will tell you the LD statistic for
those two particular SNPs. For our purposes, you do not need to look at
the whole chart, you just need to find the intersections of our five
SNPs of interest to see if any of them are in LD, and what that LD value
is for each respective statistical test.
To view your reduced
LD matrices, run these two chunks of code separately:
#for R squared LD heatmap:
R2LD<-as.data.frame(R2heatmapUCP1$LDmatrix)
snp<-rownames(R2LD)
R2LD<-
R2LD %>%
mutate(snp = snp) %>%
dplyr::select(snp,"4:140560427:T:C","4:140562317:T:A","4:140563980:G:A","4:140569265:T:G","4:140572807:T:C") %>%
filter(snp == "4:140560427:T:C" |
snp == "4:140562317:T:A" |
snp == "4:140563980:G:A" |
snp == "4:140569265:T:G" |
snp == "4:140572807:T:C")
R2LD
## snp 4:140560427:T:C 4:140562317:T:A 4:140563980:G:A
## 4:140560427:T:C 4:140560427:T:C NA 0.02744425 0.521440823
## 4:140562317:T:A 4:140562317:T:A NA NA 0.006763973
## 4:140563980:G:A 4:140563980:G:A NA NA NA
## 4:140569265:T:G 4:140569265:T:G NA NA NA
## 4:140572807:T:C 4:140572807:T:C NA NA NA
## 4:140569265:T:G 4:140572807:T:C
## 4:140560427:T:C 0.017755682 7.763975e-05
## 4:140562317:T:A 0.009433962 6.445564e-03
## 4:140563980:G:A 0.060079444 1.283954e-03
## 4:140569265:T:G NA 1.093168e-01
## 4:140572807:T:C NA NA
#for D prime LD heatmap:
DpLD<-as.data.frame(DheatmapUCP1$LDmatrix)
snp<-rownames(DpLD)
DpLD<-
DpLD %>%
mutate(snp = snp) %>%
dplyr::select(snp,"4:140560427:T:C","4:140562317:T:A","4:140563980:G:A","4:140569265:T:G","4:140572807:T:C") %>%
filter(snp == "4:140560427:T:C" |
snp == "4:140562317:T:A" |
snp == "4:140563980:G:A" |
snp == "4:140569265:T:G" |
snp == "4:140572807:T:C")
DpLD
## snp 4:140560427:T:C 4:140562317:T:A 4:140563980:G:A
## 4:140560427:T:C 4:140560427:T:C NA 1 1
## 4:140562317:T:A 4:140562317:T:A NA NA 1
## 4:140563980:G:A 4:140563980:G:A NA NA NA
## 4:140569265:T:G 4:140569265:T:G NA NA NA
## 4:140572807:T:C 4:140572807:T:C NA NA NA
## 4:140569265:T:G 4:140572807:T:C
## 4:140560427:T:C 0.1562500 0.01818182
## 4:140562317:T:A 1.0000000 1.00000000
## 4:140563980:G:A 0.2894737 0.03670669
## 4:140569265:T:G NA 0.40000000
## 4:140572807:T:C NA NA
Does LD appear to correlate with physical distance among these SNPs, according to your LDheatmap images?
NOTE: I added this section as an experiment for the Fall 2024 semester. If you cannot get it to work, don’t worry! There are no homework questions related to this section, in case the coding is a disaster!
Remember that when we think about LD, we can also think about haplotypes. A haplotype is a set of DNA variants along a single chromosome that tend to be inherited together. Given this definition, we can think of the linkage blocks we’ve identified as potential haplotypes. These are important because, since they tend to be inherited together recombination between them is rare; this, in turn, is important because it compels us to ask: why is recombination between alleles in this block rare?
Haplotypes can also be useful for conducting downstream geneti analyses that we’ll become more familiar with later in the course, including phylogenetic analyses, because these random changes that stick in haplotypes can be used to trace population relatedness over evolutionary time.
Of course, the first step to analyzing haplotypes is identifying them… let’s get started!
We’ll be using a package called {geneHapR} to quickly and easily identify haplotypes in our study populations from the 1000 Genomes Project.
library(geneHapR)
Luckily for us, we can use our same imported VCF file for this analysis, but we do need to import one more piece of data: a GFF file. The GFF file is a file that contains gene annotation data; in other words, it indicates which positions are associated with which genes or other structures in the genome (i.e., introns, exons, regulatory regions, etc). I already downloaded a GFF file for chromosome 4 from EnsEMBL for us and uploaded it to the SCC. Let’s load it into R now:
#import GFF
chrom4_gff<-import_gff("/projectnb/anth333/1KG_Chrom4/Homo_sapiens.GRCh38.112.chromosome.4.gff3")
Now, in order to run the haplotypes, we need to get rid of duplicate POS calls in our VCF files. To do this, you may need to take a look at a portion of the VCF object:
UCP1vcf@fix
## CHROM POS ID REF ALT
## [1,] "chr4" "140554647" "4:140554647:CTG:C" "CTG" "C"
## [2,] "chr4" "140554683" "4:140554683:A:G" "A" "G"
## [3,] "chr4" "140554886" "4:140554886:A:G" "A" "G"
## [4,] "chr4" "140555116" "4:140555116:T:A" "T" "A"
## [5,] "chr4" "140555173" "4:140555173:AACT:A" "AACT" "A"
## [6,] "chr4" "140555183" "4:140555183:A:C" "A" "C"
## [7,] "chr4" "140555439" "4:140555439:T:G" "T" "G"
## [8,] "chr4" "140555474" "4:140555474:C:CT" "C" "CT"
## [9,] "chr4" "140555490" "4:140555490:CAG:C" "CAG" "C"
## [10,] "chr4" "140555558" "4:140555558:C:T" "C" "T"
## [11,] "chr4" "140555670" "4:140555670:A:G" "A" "G"
## [12,] "chr4" "140555762" "4:140555762:A:G" "A" "G"
## [13,] "chr4" "140555842" "4:140555842:G:A" "G" "A"
## [14,] "chr4" "140556119" "4:140556119:T:C" "T" "C"
## [15,] "chr4" "140556351" "4:140556351:C:G" "C" "G"
## [16,] "chr4" "140556812" "4:140556812:G:A" "G" "A"
## [17,] "chr4" "140556886" "4:140556886:G:A" "G" "A"
## [18,] "chr4" "140556943" "4:140556943:C:T" "C" "T"
## [19,] "chr4" "140557061" "4:140557061:G:C" "G" "C"
## [20,] "chr4" "140557239" "4:140557239:A:AT" "A" "AT"
## [21,] "chr4" "140557309" "4:140557309:T:C" "T" "C"
## [22,] "chr4" "140557353" "4:140557353:T:C" "T" "C"
## [23,] "chr4" "140557372" "4:140557372:G:A" "G" "A"
## [24,] "chr4" "140557496" "4:140557496:T:C" "T" "C"
## [25,] "chr4" "140557525" "4:140557525:C:G" "C" "G"
## [26,] "chr4" "140557682" "4:140557682:G:A" "G" "A"
## [27,] "chr4" "140557788" "4:140557788:CTTA:C" "CTTA" "C"
## [28,] "chr4" "140557941" "4:140557941:T:C" "T" "C"
## [29,] "chr4" "140558329" "4:140558329:A:C" "A" "C"
## [30,] "chr4" "140558375" "4:140558375:C:T" "C" "T"
## [31,] "chr4" "140558513" "4:140558513:CTACTGTGTG:C" "CTACTGTGTGT" "C"
## [32,] "chr4" "140558578" "4:140558578:T:G" "T" "G"
## [33,] "chr4" "140558769" "4:140558769:G:A" "G" "A"
## [34,] "chr4" "140558845" "4:140558845:T:C" "T" "C"
## [35,] "chr4" "140559713" "4:140559713:G:GA" "G" "GA"
## [36,] "chr4" "140559713" "4:140559714:GA:G" "GA" "G"
## [37,] "chr4" "140559846" "4:140559846:C:A" "C" "A"
## [38,] "chr4" "140559997" "4:140559997:A:G" "A" "G"
## [39,] "chr4" "140560049" "4:140560049:A:AT" "A" "AT"
## [40,] "chr4" "140560049" "4:140560050:AT:A" "AT" "A"
## [41,] "chr4" "140560366" "4:140560366:C:T" "C" "T"
## [42,] "chr4" "140560427" "4:140560427:T:C" "T" "C"
## [43,] "chr4" "140560607" "4:140560607:G:A" "G" "A"
## [44,] "chr4" "140560687" "4:140560687:T:G" "T" "G"
## [45,] "chr4" "140560748" "4:140560748:C:CAAA" "C" "CAAA"
## [46,] "chr4" "140560748" "4:140560749:C:CA" "C" "CA"
## [47,] "chr4" "140560748" "4:140560750:C:CAAAA" "C" "CAAAA"
## [48,] "chr4" "140560748" "4:140560751:C:CAA" "C" "CAA"
## [49,] "chr4" "140561152" "4:140561152:A:G" "A" "G"
## [50,] "chr4" "140561330" "4:140561330:CT:C" "CT" "C"
## [51,] "chr4" "140561357" "4:140561357:CCTT:C" "CCTT" "C"
## [52,] "chr4" "140561527" "4:140561527:A:T" "A" "T"
## [53,] "chr4" "140561631" "4:140561631:G:C" "G" "C"
## [54,] "chr4" "140561674" "4:140561674:C:A" "C" "A"
## [55,] "chr4" "140562317" "4:140562317:T:A" "T" "A"
## [56,] "chr4" "140562581" "4:140562581:A:C" "A" "C"
## [57,] "chr4" "140562627" "4:140562627:A:G" "A" "G"
## [58,] "chr4" "140562887" "4:140562887:CT:C" "CT" "C"
## [59,] "chr4" "140563217" "4:140563217:G:GA" "G" "GA"
## [60,] "chr4" "140563611" "4:140563611:A:AT" "A" "AT"
## [61,] "chr4" "140563611" "4:140563612:AT:A" "AT" "A"
## [62,] "chr4" "140563980" "4:140563980:G:A" "G" "A"
## [63,] "chr4" "140564169" "4:140564169:A:G" "A" "G"
## [64,] "chr4" "140564332" "4:140564332:T:C" "T" "C"
## [65,] "chr4" "140564594" "4:140564594:A:C" "A" "C"
## [66,] "chr4" "140564853" "4:140564853:C:CT" "C" "CT"
## [67,] "chr4" "140564891" "4:140564891:C:A" "C" "A"
## [68,] "chr4" "140564982" "4:140564982:G:C" "G" "C"
## [69,] "chr4" "140565006" "4:140565006:T:C" "T" "C"
## [70,] "chr4" "140565178" "4:140565178:A:G" "A" "G"
## [71,] "chr4" "140565377" "4:140565377:T:G" "T" "G"
## [72,] "chr4" "140565390" "4:140565390:G:A" "G" "A"
## [73,] "chr4" "140565830" "4:140565830:C:T" "C" "T"
## [74,] "chr4" "140565846" "4:140565846:AC:A" "AC" "A"
## [75,] "chr4" "140566542" "4:140566542:A:T" "A" "T"
## [76,] "chr4" "140566594" "4:140566594:T:C" "T" "C"
## [77,] "chr4" "140566801" "4:140566801:A:G" "A" "G"
## [78,] "chr4" "140566828" "4:140566828:C:G" "C" "G"
## [79,] "chr4" "140567578" "4:140567578:A:C" "A" "C"
## [80,] "chr4" "140567641" "4:140567641:G:A" "G" "A"
## [81,] "chr4" "140567738" "4:140567738:C:G" "C" "G"
## [82,] "chr4" "140567914" "4:140567914:C:T" "C" "T"
## [83,] "chr4" "140567935" "4:140567935:C:T" "C" "T"
## [84,] "chr4" "140568016" "4:140568016:T:C" "T" "C"
## [85,] "chr4" "140568045" "4:140568045:CCT:C" "CCT" "C"
## [86,] "chr4" "140568074" "4:140568074:C:T" "C" "T"
## [87,] "chr4" "140568117" "4:140568117:G:A" "G" "A"
## [88,] "chr4" "140568333" "4:140568333:A:G" "A" "G"
## [89,] "chr4" "140568763" "4:140568763:G:A" "G" "A"
## [90,] "chr4" "140568792" "4:140568792:CT:C" "CT" "C"
## [91,] "chr4" "140568842" "4:140568842:T:G" "T" "G"
## [92,] "chr4" "140568861" "4:140568861:G:A" "G" "A"
## [93,] "chr4" "140568909" "4:140568909:G:A" "G" "A"
## [94,] "chr4" "140568998" "4:140568998:C:T" "C" "T"
## [95,] "chr4" "140569040" "4:140569040:C:A" "C" "A"
## [96,] "chr4" "140569219" "4:140569219:C:T" "C" "T"
## [97,] "chr4" "140569265" "4:140569265:T:G" "T" "G"
## [98,] "chr4" "140569304" "4:140569304:G:A" "G" "A"
## [99,] "chr4" "140569481" "4:140569481:C:T" "C" "T"
## [100,] "chr4" "140569496" "4:140569496:C:T" "C" "T"
## [101,] "chr4" "140569510" "4:140569510:G:A" "G" "A"
## [102,] "chr4" "140569826" "4:140569826:C:T" "C" "T"
## [103,] "chr4" "140569931" "4:140569931:A:G" "A" "G"
## [104,] "chr4" "140569940" "4:140569940:T:TA" "T" "TA"
## [105,] "chr4" "140569941" "4:140569941:T:A" "T" "A"
## [106,] "chr4" "140570045" "4:140570045:T:TA" "T" "TA"
## [107,] "chr4" "140570066" "4:140570066:G:T" "G" "T"
## [108,] "chr4" "140570422" "4:140570422:C:A" "C" "A"
## [109,] "chr4" "140570429" "4:140570429:C:T" "C" "T"
## [110,] "chr4" "140570542" "4:140570542:G:GA" "G" "GA"
## [111,] "chr4" "140570604" "4:140570604:C:T" "C" "T"
## [112,] "chr4" "140570619" "4:140570619:T:C" "T" "C"
## [113,] "chr4" "140570765" "4:140570765:G:A" "G" "A"
## [114,] "chr4" "140570776" "4:140570776:G:A" "G" "A"
## [115,] "chr4" "140570850" "4:140570850:A:G" "A" "G"
## [116,] "chr4" "140570901" "4:140570901:A:G" "A" "G"
## [117,] "chr4" "140570919" "4:140570919:G:A" "G" "A"
## [118,] "chr4" "140570972" "4:140570972:G:A" "G" "A"
## [119,] "chr4" "140571067" "4:140571067:A:C" "A" "C"
## [120,] "chr4" "140571085" "4:140571085:C:A" "C" "A"
## [121,] "chr4" "140571190" "4:140571190:C:T" "C" "T"
## [122,] "chr4" "140571295" "4:140571295:T:C" "T" "C"
## [123,] "chr4" "140571482" "4:140571482:A:G" "A" "G"
## [124,] "chr4" "140571531" "4:140571531:G:C" "G" "C"
## [125,] "chr4" "140571583" "4:140571583:C:T" "C" "T"
## [126,] "chr4" "140571632" "4:140571632:G:A" "G" "A"
## [127,] "chr4" "140571789" "4:140571789:T:A" "T" "A"
## [128,] "chr4" "140571800" "4:140571800:T:A" "T" "A"
## [129,] "chr4" "140571850" "4:140571850:A:G" "A" "G"
## [130,] "chr4" "140571852" "4:140571852:A:G" "A" "G"
## [131,] "chr4" "140572036" "4:140572036:A:C" "A" "C"
## [132,] "chr4" "140572112" "4:140572112:A:G" "A" "G"
## [133,] "chr4" "140572685" "4:140572685:T:A" "T" "A"
## [134,] "chr4" "140572723" "4:140572723:G:T" "G" "T"
## [135,] "chr4" "140572807" "4:140572807:T:C" "T" "C"
## [136,] "chr4" "140572930" "4:140572930:C:G" "C" "G"
## [137,] "chr4" "140572950" "4:140572950:A:G" "A" "G"
## [138,] "chr4" "140573074" "4:140573074:A:T" "A" "T"
## [139,] "chr4" "140573151" "4:140573151:A:G" "A" "G"
## [140,] "chr4" "140573322" "4:140573322:C:CTTTTTTT" "C" "CTTTTTTT"
## [141,] "chr4" "140573322" "4:140573323:C:CTTTTTTTT" "C" "CTTTTTTTT"
## [142,] "chr4" "140573322" "4:140573324:C:CTTTCTTT" "C" "CTTTCTTT"
## [143,] "chr4" "140573322" "4:140573325:C:CTTTTTT" "C" "CTTTTTT"
## [144,] "chr4" "140573346" "4:140573346:A:C" "A" "C"
## [145,] "chr4" "140573371" "4:140573371:G:A" "G" "A"
## [146,] "chr4" "140573392" "4:140573392:C:T" "C" "T"
## [147,] "chr4" "140573402" "4:140573402:ACCTC:A" "ACCTC" "A"
## [148,] "chr4" "140573493" "4:140573493:A:T" "A" "T"
## [149,] "chr4" "140573712" "4:140573712:C:T" "C" "T"
## [150,] "chr4" "140573773" "4:140573773:A:G" "A" "G"
## QUAL FILTER INFO
## [1,] NA "PASS" NA
## [2,] NA "PASS" NA
## [3,] NA "PASS" NA
## [4,] NA "PASS" NA
## [5,] NA "PASS" NA
## [6,] NA "PASS" NA
## [7,] NA "PASS" NA
## [8,] NA "PASS" NA
## [9,] NA "PASS" NA
## [10,] NA "PASS" NA
## [11,] NA "PASS" NA
## [12,] NA "PASS" NA
## [13,] NA "PASS" NA
## [14,] NA "PASS" NA
## [15,] NA "PASS" NA
## [16,] NA "PASS" NA
## [17,] NA "PASS" NA
## [18,] NA "PASS" NA
## [19,] NA "PASS" NA
## [20,] NA "PASS" NA
## [21,] NA "PASS" NA
## [22,] NA "PASS" NA
## [23,] NA "PASS" NA
## [24,] NA "PASS" NA
## [25,] NA "PASS" NA
## [26,] NA "PASS" NA
## [27,] NA "PASS" NA
## [28,] NA "PASS" NA
## [29,] NA "PASS" NA
## [30,] NA "PASS" NA
## [31,] NA "PASS" NA
## [32,] NA "PASS" NA
## [33,] NA "PASS" NA
## [34,] NA "PASS" NA
## [35,] NA "PASS" NA
## [36,] NA "PASS" NA
## [37,] NA "PASS" NA
## [38,] NA "PASS" NA
## [39,] NA "PASS" NA
## [40,] NA "PASS" NA
## [41,] NA "PASS" NA
## [42,] NA "PASS" NA
## [43,] NA "PASS" NA
## [44,] NA "PASS" NA
## [45,] NA "PASS" NA
## [46,] NA "PASS" NA
## [47,] NA "PASS" NA
## [48,] NA "PASS" NA
## [49,] NA "PASS" NA
## [50,] NA "PASS" NA
## [51,] NA "PASS" NA
## [52,] NA "PASS" NA
## [53,] NA "PASS" NA
## [54,] NA "PASS" NA
## [55,] NA "PASS" NA
## [56,] NA "PASS" NA
## [57,] NA "PASS" NA
## [58,] NA "PASS" NA
## [59,] NA "PASS" NA
## [60,] NA "PASS" NA
## [61,] NA "PASS" NA
## [62,] NA "PASS" NA
## [63,] NA "PASS" NA
## [64,] NA "PASS" NA
## [65,] NA "PASS" NA
## [66,] NA "PASS" NA
## [67,] NA "PASS" NA
## [68,] NA "PASS" NA
## [69,] NA "PASS" NA
## [70,] NA "PASS" NA
## [71,] NA "PASS" NA
## [72,] NA "PASS" NA
## [73,] NA "PASS" NA
## [74,] NA "PASS" NA
## [75,] NA "PASS" NA
## [76,] NA "PASS" NA
## [77,] NA "PASS" NA
## [78,] NA "PASS" NA
## [79,] NA "PASS" NA
## [80,] NA "PASS" NA
## [81,] NA "PASS" NA
## [82,] NA "PASS" NA
## [83,] NA "PASS" NA
## [84,] NA "PASS" NA
## [85,] NA "PASS" NA
## [86,] NA "PASS" NA
## [87,] NA "PASS" NA
## [88,] NA "PASS" NA
## [89,] NA "PASS" NA
## [90,] NA "PASS" NA
## [91,] NA "PASS" NA
## [92,] NA "PASS" NA
## [93,] NA "PASS" NA
## [94,] NA "PASS" NA
## [95,] NA "PASS" NA
## [96,] NA "PASS" NA
## [97,] NA "PASS" NA
## [98,] NA "PASS" NA
## [99,] NA "PASS" NA
## [100,] NA "PASS" NA
## [101,] NA "PASS" NA
## [102,] NA "PASS" NA
## [103,] NA "PASS" NA
## [104,] NA "PASS" NA
## [105,] NA "PASS" NA
## [106,] NA "PASS" NA
## [107,] NA "PASS" NA
## [108,] NA "PASS" NA
## [109,] NA "PASS" NA
## [110,] NA "PASS" NA
## [111,] NA "PASS" NA
## [112,] NA "PASS" NA
## [113,] NA "PASS" NA
## [114,] NA "PASS" NA
## [115,] NA "PASS" NA
## [116,] NA "PASS" NA
## [117,] NA "PASS" NA
## [118,] NA "PASS" NA
## [119,] NA "PASS" NA
## [120,] NA "PASS" NA
## [121,] NA "PASS" NA
## [122,] NA "PASS" NA
## [123,] NA "PASS" NA
## [124,] NA "PASS" NA
## [125,] NA "PASS" NA
## [126,] NA "PASS" NA
## [127,] NA "PASS" NA
## [128,] NA "PASS" NA
## [129,] NA "PASS" NA
## [130,] NA "PASS" NA
## [131,] NA "PASS" NA
## [132,] NA "PASS" NA
## [133,] NA "PASS" NA
## [134,] NA "PASS" NA
## [135,] NA "PASS" NA
## [136,] NA "PASS" NA
## [137,] NA "PASS" NA
## [138,] NA "PASS" NA
## [139,] NA "PASS" NA
## [140,] NA "PASS" NA
## [141,] NA "PASS" NA
## [142,] NA "PASS" NA
## [143,] NA "PASS" NA
## [144,] NA "PASS" NA
## [145,] NA "PASS" NA
## [146,] NA "PASS" NA
## [147,] NA "PASS" NA
## [148,] NA "PASS" NA
## [149,] NA "PASS" NA
## [150,] NA "PASS" NA
You’ll take the row number of each duplicate POS number and place it
in the code below (for example, the position 140559713
occurs twice, so I’ll put the row number of the duplicate row - for any
row more than the first - in this case, 36, in the code
below). Notice I’ve done this for every repeated position on the list
above:
#clean out duplicate positions:
UCP1vcf@fix <- UCP1vcf@fix[-c(36,40,46,47,48,61,141,142,143),]
UCP1vcf@gt <- UCP1vcf@gt[-c(36,40,46,47,48,61,141,142,143),]
Now we’re ready to calculate the haplotypes!
#calculate haplotypes
hapResult <- vcf2hap(UCP1vcf,
hapPrefix = "H",
hetero_remove = TRUE,
na_drop = TRUE)
## Extracting gt element GT
hapSummary <- hap_summary(hapResult)
hapSummary
##
## Accssions: 5
## Sites: 38
## Indels: 9
## SNPs: 29
##
## Haplotypes: 5
## H001 1 NA18933
## H002 1 NA19099
## H003 1 NA19118
## H004 1 NA18502
## H005 1 NA19146
##
## Options:
## hapPrefix: H
## CHROM: chr4
## POS: 140554647-140573773
## hetero_remove: YES
## NA_remove: YES
##
## # A tibble: 9 × 41
## Hap `140554683` `140555762` `140556119` `140556886` `140557496` `140557525`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 CHR chr4 chr4 chr4 chr4 chr4 chr4
## 2 POS 140554683 140555762 140556119 140556886 140557496 140557525
## 3 INFO <NA> <NA> <NA> <NA> <NA> <NA>
## 4 ALLELE A/G A/G T/C G/A T/C C/G
## 5 H001 A G C A C G
## 6 H002 A G C G T G
## 7 H003 A G C G T G
As you can see from the output, {geneHapR} was able to construct five haplotypes from 38 different variant sites, including 29 biallelic SNPs and 9 indels. It also lists a representative sample for each haplotype.
You may be wondering why only 38 sites went into this when there are 150 loci in the VCF file… {geneHapR} automatically excludes loci where any variant has a minor allele frequency of less than 0.05 (i.e, sites where only one or two people have the minor allele). This conservative practice keeps us from assuming IBD when individuals might share those alleles by random mutation.
{geneHapR} also provides a prettified view of our haplotypes:
#haplotype table
plotHapTable(hapSummary)
You may notice that there are several loci that are identical across multiple haplotypes… for example, the first two loci (140554683 and 140555762) appear to segregate identically. These may represent distances in our haplotype map that have not yet been broken up by recombination; these identically segregating sites in the construction of our haplotypes can also allow us to construct haplotype networks. A haplotype network assumes that identical segregating sites are identical by descent (or inherited from an ancestor), while a segregating site with just one difference from the downstream pattern (like the C alelle in haplotype H004 at the third locus, 140556119; were it a T, it would also be identical to the pattern of the first two loci) represent a novel change. We can then construct networks based on counting changes across haplotypes.
Let’s take a look at a haplotype network for UCP1 in YRI, that might help us to understand better:
#create haplotype network
hapNet <- get_hapNet(hapSummary,
groupName = "Type")
#plot haplotype network
plotHapNet(hapNet,
size = "freq", # circle size
scale = "log2", # scale circle with 'log10(size + 1)'
cex = 0.8, # size of hap symbol
col.link = 2, # link colors
link.width = 2, # link widths
show.mutation = 2, # mutation types one of c(0,1,2,3)
legend = c(-12.5, 7)) # legend position
According to our haplotype network, it looks like H002 and H004 are the haplotypes with the most samples, and they have quite a few differences between them (each dot on the line represents a single letter change between haplotypes). H003 only has a few changes from H002 (indeed, if you look at the table, they only differ at a few loci: 140568763, 140570066, 140570422, 140572723, 140573074, 140573322). As you cans see, the other haplotypes differ in different ways and amounts, but can still be related to each other by patterns of shared differences.
We also, helpfully, can use our GFF file to map out our haplotype
variants on the EnsEMBL gene models by entering our
hapSummary, GFF, and VCF positions below:
#plot gene model
displayVarOnGeneModel(hapSummary, chrom4_gff,
Chr = "4",
startPOS = 140549241, endPOS = 140578900,
type = "pin", cex = 0.7,
CDS_h = 0.05, fiveUTR_h = 0.02, threeUTR_h = 0.01)
As you can see, our haplotype loci are spread throughout the gene region, although some are in apparent physical clusters (you may notice that those are the ones that seem most similar). Within our selected gene region, we actually have two different gene models: in green we have the exons of UCP1, and in pink we have an exon of a gene called ELMOD2.
We can potentially use this map to see where linkage equilibrium is broken, based on breaks across our differing haplotypes.
Finally, {geneHapR} can also be used to construct LD maps, this time built from our haplotype-informative variant sites:
plot_LDheatmap(hap=hapResult,
gff=chrom4_gff,
Chr=4,
start=140549241,
end=140578900,
distances="physical",
LDmeasure="r",
title="Pairwise LD in UCP1 for YRI",
add.map=TRUE,
map.height=1,
colorLegend=TRUE,
geneMapLocation=0.1,
SNP.name=TRUE,
snpmarks_height=0.1,
name="ucp1_yri_ldheatmap",
pop=FALSE,
text=FALSE)
Unfortunately, it looks like we’re unable to get our gene model to map above the LD heatmap :(
I’ll continue to troubleshoot and see if I can get it to work!
As you can see, this linkage map roughly replicates the patterns we saw in our more refined maps (more refined in that we used all 150 variant sites rather than just 38).
The last thing we’ll do in this lab is look at EnsEMBL’s
information on LD for each of the SNPs we studied today in your
populations.
To do this:
Go to EnsEMBL and search for UCP1 in the Human search bar as we have done before
Use the left sidebar to navigate to the Variant Table:
Some populations will have more SNPs in high LD than others. Look specifically at the SNPs in high LD that are not in UCP1. In the example above, there’s a SNP (rs2321273) listed in the gene ELMOD2 in high linkage with a variant in UCP1.
Repeat this process for all five SNPs we looked at in today’s module Where are the high LD SNPs that aren’t in UCP1 located? Are any of them located in other gene regions?
Think about the results you produced today in the context of your
population. Here are some guiding questions to help you:
What does it mean for two SNPs in your population to be in LD? How could two obesity-related SNPs being in LD reflect selection in that population?
Would you expect a high level of LD to develop in UCP1 in your population based on its ecological context? Why or why not?
What would LD between two UCP1-related SNPs tell you about the evolutionary history of your population?
If you found any SNPs in high LD with our UCP1 SNPs
during the EnsEMBL exercise that are in a different gene, do a
quick OMIM search on the gene(s). What
does that gene code for? How might it be connected to UCP1?